Skip to content

add results for Granite Embedding English R2 models#256

Merged
KennethEnevoldsen merged 1 commit intoembeddings-benchmark:mainfrom
aashka-trivedi:main
Aug 18, 2025
Merged

add results for Granite Embedding English R2 models#256
KennethEnevoldsen merged 1 commit intoembeddings-benchmark:mainfrom
aashka-trivedi:main

Conversation

@aashka-trivedi
Copy link
Contributor

Adds results for Granite Embedding English R2 models

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR 3050
    • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@KennethEnevoldsen
Copy link
Contributor

@aashka-trivedi, great to have this PR - I will mark this for review after the model has been merged in

@KennethEnevoldsen KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Aug 18, 2025
@github-actions
Copy link

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: ibm-granite/granite-embedding-english-r2, ibm-granite/granite-embedding-small-english-r2
Tasks: AmazonCounterfactualClassification, AmazonPolarityClassification, AmazonReviewsClassification, AppsRetrieval, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArxivClusteringP2P, ArxivClusteringS2S, AskUbuntuDupQuestions, BIOSSES, Banking77Classification, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, COIRCodeSearchNetRetrieval, CQADupstackGamingRetrieval, CQADupstackRetrieval, CQADupstackUnixRetrieval, ClimateFEVER, ClimateFEVERHardNegatives, CodeFeedbackMT, CodeFeedbackST, CodeSearchNetCCRetrieval, CodeTransOceanContest, CodeTransOceanDL, CosQA, DBPedia, EmotionClassification, FEVER, FEVERHardNegatives, FiQA2018, HotpotQA, HotpotQAHardNegatives, ImdbClassification, LEMBNarrativeQARetrieval, LEMBNeedleRetrieval, LEMBPasskeyRetrieval, LEMBQMSumRetrieval, LEMBSummScreenFDRetrieval, LEMBWikimQARetrieval, MSMARCO, MTOPDomainClassification, MTOPIntentClassification, MassiveIntentClassification, MassiveScenarioClassification, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MindSmallReranking, MultiLongDocRetrieval, NFCorpus, NQ, QuoraRetrieval, RedditClustering, RedditClusteringP2P, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS16, STS17, STS22, STS22.v2, STSBenchmark, SciDocsRR, SciFact, SprintDuplicateQuestions, StackExchangeClustering, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P.v2, StackOverflowDupQuestions, StackOverflowQA, SummEval, SummEvalSummarization.v2, SyntheticText2SQL, TRECCOVID, Touche2020, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus

Results for ibm-granite/granite-embedding-english-r2

task_name google/gemini-embedding-001 ibm-granite/granite-embedding-english-r2 intfloat/multilingual-e5-large Max result
AmazonCounterfactualClassification 0.9269 0.6545 nan 0.9696
AmazonPolarityClassification nan 0.6693 0.9326 0.9774
AmazonReviewsClassification nan 0.3327 nan 0.6880
AppsRetrieval 0.9375 0.1396 0.3255 0.9375
ArXivHierarchicalClusteringP2P 0.6492 0.5906 0.5569 0.6869
ArXivHierarchicalClusteringS2S 0.6384 0.5736 0.5367 0.6548
ArguAna 0.8644 0.5921 0.5436 0.8979
ArxivClusteringP2P nan 0.4864 0.4431 0.6092
ArxivClusteringS2S nan 0.4459 0.3843 0.5520
AskUbuntuDupQuestions 0.6424 0.6648 0.5924 0.7020
BIOSSES 0.8897 0.8659 0.8457 0.9692
Banking77Classification 0.9427 0.8556 0.7492 0.9427
BiorxivClusteringP2P nan 0.3889 0.355 0.5522
BiorxivClusteringP2P.v2 0.5386 0.4184 0.372 0.5642
BiorxivClusteringS2S nan 0.3718 0.333 0.5093
COIRCodeSearchNetRetrieval 0.8106 0.6465 nan 0.8951
CQADupstackGamingRetrieval 0.7068 0.6504 0.587 0.7861
CQADupstackRetrieval nan 0.5 0.3967 0.6830
CQADupstackUnixRetrieval 0.5369 0.5285 0.3988 0.7198
ClimateFEVER nan 0.3582 0.2573 0.5693
ClimateFEVERHardNegatives 0.3106 0.3597 0.26 0.4900
CodeFeedbackMT 0.5628 0.5254 0.4278 0.9370
CodeFeedbackST 0.8533 0.7718 0.7426 0.9067
CodeSearchNetCCRetrieval 0.8469 0.4767 0.7783 0.9635
CodeTransOceanContest 0.8953 0.7707 0.7403 0.9496
CodeTransOceanDL 0.3147 0.3503 0.3128 0.4419
CosQA 0.5024 0.3701 0.348 0.5218
DBPedia nan 0.396 0.413 0.5350
EmotionClassification nan 0.4131 0.4758 0.9387
FEVER nan 0.8804 0.8281 0.9628
FEVERHardNegatives 0.8898 0.8892 0.8379 0.9453
FiQA2018 0.6178 0.4632 0.4381 0.7991
HotpotQA nan 0.6736 0.7122 0.8758
HotpotQAHardNegatives 0.8701 0.6708 0.7055 0.8701
ImdbClassification 0.9498 0.6191 0.8867 0.9737
LEMBNarrativeQARetrieval nan 0.4785 0.2422 0.6070
LEMBNeedleRetrieval nan 0.43 0.28 0.8800
LEMBPasskeyRetrieval 0.3850 0.8175 0.3825 1.0000
LEMBQMSumRetrieval nan 0.4158 0.2426 0.5507
LEMBSummScreenFDRetrieval nan 0.9365 0.7112 0.9782
LEMBWikimQARetrieval nan 0.859 0.568 0.8890
MSMARCO nan 0.3214 0.437 0.4812
MTOPDomainClassification 0.9927 0.9235 0.9097 0.9995
MTOPIntentClassification nan 0.7104 nan 0.9551
MassiveIntentClassification 0.8846 0.7056 0.6804 0.9194
MassiveScenarioClassification 0.9208 0.7524 0.7178 0.9930
MedrxivClusteringP2P nan 0.3303 0.317 0.5153
MedrxivClusteringP2P.v2 0.4716 0.3615 0.3431 0.5179
MedrxivClusteringS2S nan 0.3224 0.2976 0.4969
MedrxivClusteringS2S.v2 0.4501 0.3574 0.3152 0.5106
MindSmallReranking 0.3295 0.3172 0.3024 0.3412
MultiLongDocRetrieval nan 0.4156 0.3302 0.5099
NFCorpus nan 0.3749 0.3398 0.5575
NQ nan 0.5822 0.6403 0.8248
QuoraRetrieval nan 0.8784 0.8926 0.9235
RedditClustering nan 0.5322 0.4691 0.7716
RedditClusteringP2P nan 0.5642 0.63 0.7527
SCIDOCS 0.2515 0.2495 0.1745 0.3453
SICK-R 0.8275 0.7134 0.8023 0.9465
STS12 0.8155 0.6702 0.8002 0.9546
STS13 0.8989 0.8409 0.8155 0.9776
STS14 0.8541 0.7477 0.7772 0.9753
STS15 0.9044 0.8537 0.8931 0.9811
STS16 nan 0.7888 0.8579 0.9763
STS17 0.9161 0.8621 0.8812 0.9586
STS22 nan 0.685 nan 0.7310
STS22.v2 0.6797 0.6847 0.6366 0.7497
STSBenchmark 0.8908 0.7917 0.8729 0.9370
SciDocsRR nan 0.8816 0.8422 0.9114
SciFact nan 0.758 0.702 0.8660
SprintDuplicateQuestions 0.9690 0.9463 0.9314 0.9787
StackExchangeClustering nan 0.677 0.5837 0.8395
StackExchangeClustering.v2 0.9207 0.5958 0.4643 0.9207
StackExchangeClusteringP2P nan 0.3416 0.329 0.5157
StackExchangeClusteringP2P.v2 0.5091 0.4014 0.3854 0.5509
StackOverflowDupQuestions nan 0.5427 0.5014 0.6292
StackOverflowQA 0.9671 0.918 0.8889 0.9717
SummEval nan 0.3152 0.2964 0.4052
SummEvalSummarization.v2 0.3828 0.2931 0.3141 0.3893
SyntheticText2SQL 0.6996 0.4955 0.5307 0.7875
TRECCOVID 0.8631 0.7056 0.7115 0.9499
Touche2020 nan 0.229 0.2313 0.3939
Touche2020Retrieval.v3 0.5239 0.5343 0.4959 0.7465
ToxicConversationsClassification 0.8875 0.6208 0.6601 0.9759
TweetSentimentExtractionClassification 0.6988 0.5256 0.628 0.8823
TwentyNewsgroupsClustering nan 0.479 0.394 0.8349
TwentyNewsgroupsClustering.v2 0.5737 0.4777 0.3921 0.8758
TwitterSemEval2015 0.7917 0.6006 0.7528 0.8946
TwitterURLCorpus 0.8705 0.8334 0.8583 0.9571
Average 0.7275 0.5821 0.5592 0.7726

Results for ibm-granite/granite-embedding-small-english-r2

task_name google/gemini-embedding-001 ibm-granite/granite-embedding-small-english-r2 intfloat/multilingual-e5-large Max result
AmazonCounterfactualClassification 0.9269 0.6178 nan 0.9696
AmazonPolarityClassification nan 0.6169 0.9326 0.9774
AmazonReviewsClassification nan 0.3215 nan 0.6880
AppsRetrieval 0.9375 0.1354 0.3255 0.9375
ArXivHierarchicalClusteringP2P 0.6492 0.571 0.5569 0.6869
ArXivHierarchicalClusteringS2S 0.6384 0.5804 0.5367 0.6548
ArguAna 0.8644 0.544 0.5436 0.8979
ArxivClusteringP2P nan 0.4802 0.4431 0.6092
ArxivClusteringS2S nan 0.4407 0.3843 0.5520
AskUbuntuDupQuestions 0.6424 0.6483 0.5924 0.7020
BIOSSES 0.8897 0.865 0.8457 0.9692
Banking77Classification 0.9427 0.8363 0.7492 0.9427
BiorxivClusteringP2P nan 0.389 0.355 0.5522
BiorxivClusteringP2P.v2 0.5386 0.4088 0.372 0.5642
BiorxivClusteringS2S nan 0.3633 0.333 0.5093
COIRCodeSearchNetRetrieval 0.8106 0.6046 nan 0.8951
CQADupstackGamingRetrieval 0.7068 0.6244 0.587 0.7861
CQADupstackRetrieval nan 0.4783 0.3967 0.6830
CQADupstackUnixRetrieval 0.5369 0.5113 0.3988 0.7198
ClimateFEVER nan 0.3156 0.2573 0.5693
ClimateFEVERHardNegatives 0.3106 0.3169 0.26 0.4900
CodeFeedbackMT 0.5628 0.5219 0.4278 0.9370
CodeFeedbackST 0.8533 0.7685 0.7426 0.9067
CodeSearchNetCCRetrieval 0.8469 0.4842 0.7783 0.9635
CodeTransOceanContest 0.8953 0.7763 0.7403 0.9496
CodeTransOceanDL 0.3147 0.3363 0.3128 0.4419
CosQA 0.5024 0.3558 0.348 0.5218
DBPedia nan 0.3785 0.413 0.5350
EmotionClassification nan 0.3457 0.4758 0.9387
FEVER nan 0.8648 0.8281 0.9628
FEVERHardNegatives 0.8898 0.8758 0.8379 0.9453
FiQA2018 0.6178 0.4081 0.4381 0.7991
HotpotQA nan 0.6565 0.7122 0.8758
HotpotQAHardNegatives 0.8701 0.6623 0.7055 0.8701
ImdbClassification 0.9498 0.6037 0.8867 0.9737
LEMBNarrativeQARetrieval nan 0.4132 0.2422 0.6070
LEMBNeedleRetrieval nan 0.55 0.28 0.8800
LEMBPasskeyRetrieval 0.3850 0.7975 0.3825 1.0000
LEMBQMSumRetrieval nan 0.3648 0.2426 0.5507
LEMBSummScreenFDRetrieval nan 0.8991 0.7112 0.9782
LEMBWikimQARetrieval nan 0.7995 0.568 0.8890
MSMARCO nan 0.3013 0.437 0.4812
MTOPDomainClassification 0.9927 0.9015 0.9097 0.9995
MTOPIntentClassification nan 0.6688 nan 0.9551
MassiveIntentClassification 0.8846 0.6708 0.6804 0.9194
MassiveScenarioClassification 0.9208 0.7279 0.7178 0.9930
MedrxivClusteringP2P nan 0.329 0.317 0.5153
MedrxivClusteringP2P.v2 0.4716 0.3646 0.3431 0.5179
MedrxivClusteringS2S nan 0.3261 0.2976 0.4969
MedrxivClusteringS2S.v2 0.4501 0.36 0.3152 0.5106
MindSmallReranking 0.3295 0.3042 0.3024 0.3412
MultiLongDocRetrieval nan 0.4007 0.3302 0.5099
NFCorpus nan 0.3714 0.3398 0.5575
NQ nan 0.5537 0.6403 0.8248
QuoraRetrieval nan 0.8736 0.8926 0.9235
RedditClustering nan 0.5024 0.4691 0.7716
RedditClusteringP2P nan 0.5492 0.63 0.7527
SCIDOCS 0.2515 0.2406 0.1745 0.3453
SICK-R 0.8275 0.6886 0.8023 0.9465
STS12 0.8155 0.6741 0.8002 0.9546
STS13 0.8989 0.8057 0.8155 0.9776
STS14 0.8541 0.7294 0.7772 0.9753
STS15 0.9044 0.8386 0.8931 0.9811
STS16 nan 0.7799 0.8579 0.9763
STS17 0.9161 0.8518 0.8812 0.9586
STS22 nan 0.6684 nan 0.7310
STS22.v2 0.6797 0.6685 0.6366 0.7497
STSBenchmark 0.8908 0.771 0.8729 0.9370
SciDocsRR nan 0.8754 0.8422 0.9114
SciFact nan 0.7549 0.702 0.8660
SprintDuplicateQuestions 0.9690 0.9493 0.9314 0.9787
StackExchangeClustering nan 0.6603 0.5837 0.8395
StackExchangeClustering.v2 0.9207 0.5828 0.4643 0.9207
StackExchangeClusteringP2P nan 0.3509 0.329 0.5157
StackExchangeClusteringP2P.v2 0.5091 0.4068 0.3854 0.5509
StackOverflowDupQuestions nan 0.5405 0.5014 0.6292
StackOverflowQA 0.9671 0.9004 0.8889 0.9717
SummEval nan 0.287 0.2964 0.4052
SummEvalSummarization.v2 0.3828 0.2674 0.3141 0.3893
SyntheticText2SQL 0.6996 0.4633 0.5307 0.7875
TRECCOVID 0.8631 0.6467 0.7115 0.9499
Touche2020 nan 0.2417 0.2313 0.3939
Touche2020Retrieval.v3 0.5239 0.5625 0.4959 0.7465
ToxicConversationsClassification 0.8875 0.5937 0.6601 0.9759
TweetSentimentExtractionClassification 0.6988 0.5005 0.628 0.8823
TwentyNewsgroupsClustering nan 0.4603 0.394 0.8349
TwentyNewsgroupsClustering.v2 0.5737 0.4477 0.3921 0.8758
TwitterSemEval2015 0.7917 0.5815 0.7528 0.8946
TwitterURLCorpus 0.8705 0.8277 0.8583 0.9571
Average 0.7275 0.5658 0.5592 0.7726

@KennethEnevoldsen
Copy link
Contributor

No immediately concerning results here - everything looks within a reasonable range given training data annotations

@KennethEnevoldsen KennethEnevoldsen merged commit 42ffe7f into embeddings-benchmark:main Aug 18, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting for review of implementation This PR is waiting for an implementation review before merging the results.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants